Skip to content

feat: Add Amazon S3 Vectors document store integration#3149

Open
dotKokott wants to merge 9 commits intodeepset-ai:mainfrom
dotKokott:feature/amazon-s3-vectors-integration
Open

feat: Add Amazon S3 Vectors document store integration#3149
dotKokott wants to merge 9 commits intodeepset-ai:mainfrom
dotKokott:feature/amazon-s3-vectors-integration

Conversation

@dotKokott
Copy link
Copy Markdown
Contributor

@dotKokott dotKokott commented Apr 13, 2026

Related Issues

Proposed Changes:

Adds an Amazon S3 Vectors document store integration — a serverless vector storage capability native to S3.

Components:

  • S3VectorsDocumentStore — full DocumentStore protocol (write, count, filter, delete)
  • S3VectorsEmbeddingRetriever — embedding-based retrieval with server-side metadata filtering

Key design decisions:

  • Content stored as non-filterable metadata (AWS-recommended pattern for large text)
  • Cosine distance converted to similarity score (1 - distance) for Haystack convention
  • Blob data uses base64 encoding for round-trip fidelity
  • filter_documents() uses list_vectors(returnData=True, returnMetadata=True) with client-side filtering (warning logged) since S3 Vectors has no standalone filter API
  • Batch existence checks for DuplicatePolicy.SKIP/NONE (batches of 100)

Known limitations (documented in README):

  • top_k capped at 100 (service limit)
  • query_vectors does not return embedding data
  • 40KB total metadata per vector, 2KB filterable
  • Only float32, cosine/euclidean, eventual consistency

How did you test it?

  • 26 unit tests — serialization, score conversion, filter conversion, duplicate policy logic, document conversion (mocked boto3)
  • 12 integration tests — full lifecycle against live AWS S3 Vectors, with pytestmark credential guard for CI
  • hatch run test:all, hatch run fmt, hatch run test:types
  • Example script (examples/example.py) verified against live AWS

Notes for the reviewer

This PR was fully generated with an AI assistant. I have reviewed the changes and run the relevant tests.
Structure and test style follow the Pinecone integration pattern.

Checklist

@github-actions github-actions Bot added topic:CI type:documentation Improvements or additions to documentation labels Apr 13, 2026
Implements issue deepset-ai#2110 - Amazon S3 Vectors document store integration with:

- S3VectorsDocumentStore: full DocumentStore protocol (count, write, filter, delete)
- S3VectorsEmbeddingRetriever: embedding-based retrieval with metadata filtering
- Filter conversion from Haystack format to S3 Vectors filter syntax
- Auto-creation of vector buckets and indexes
- AWS credential support via Secret (or default credential chain)
- 49 unit tests covering store, retriever, filters, and serialization
- README with usage examples and known limitations
…rkflow

- boto3 lower bound set to 1.42.0 (when s3vectors service was added)
- pydoc filename changed to amazon_s3_vectors.md (underscores, matching folder name)
- Quote $GITHUB_OUTPUT in workflow to fix shellcheck SC2086
- Flatten test classes into standalone functions (matching pinecone/qdrant pattern)
- Assert full serialized dict structure in to_dict/from_dict tests
- Use Mock(spec=...) for retriever tests instead of MagicMock+patch
- Verify _embedding_retrieval call args match exactly
- Add test_from_dict_no_filter_policy (backward compat)
- Add test_init_is_lazy
Remove tests that just verify mock plumbing (count, write, delete calling
the mock client). Keep tests that verify our actual logic:
- Serialization roundtrip (full dict structure)
- Score conversion (cosine + euclidean)
- Filter conversion (pure function with real logic)
- Duplicate policy batch checks (SKIP/NONE)
- Document <-> S3 vector conversion
- Input validation

Before: 49 unit tests (many testing mock behavior)
After: 26 unit tests (all testing our code) + 12 integration tests
- Class docstring: top_k cap, dimension limit, metadata limits, float32 only
- write_documents: embedding required, 40KB metadata limit
- _embedding_retrieval: top_k=100 cap, no embeddings in response
- Retriever run: top_k=100, server-side filters, no embeddings returned
…ity, deduplicate retrieval logic

- Replace hand-rolled _apply_filters_in_memory/_document_matches/_compare
  with haystack.utils.filters.document_matches_filter (same utility used by
  InMemoryDocumentStore). Gains NOT operator, nested dotted field paths, and
  date comparison support for free. (-65 lines)
- Deduplicate blob/content reconstruction in _embedding_retrieval() by
  reusing _s3_vector_to_document() + dataclasses.replace() (-20 lines)
- Make filter_documents() warning conditional on filters actually being
  provided (no warning when listing all documents)
@dotKokott dotKokott force-pushed the feature/amazon-s3-vectors-integration branch from 1df9666 to 90c4977 Compare April 13, 2026 13:28
@dotKokott dotKokott marked this pull request as ready for review April 13, 2026 13:39
@dotKokott dotKokott requested a review from a team as a code owner April 13, 2026 13:39
@dotKokott dotKokott requested review from anakin87 and removed request for a team April 13, 2026 13:39
@dotKokott dotKokott marked this pull request as draft April 13, 2026 13:39
@dotKokott dotKokott marked this pull request as ready for review April 13, 2026 13:40
@dotKokott
Copy link
Copy Markdown
Contributor Author

CI: Integration tests need AWS credential setup

The integration tests currently run unconditionally in CI with no AWS credentials configured. The tests have a pytestmark = pytest.mark.skipif(not _aws_credentials_available(), ...) guard so they silently skip (0 collected), but this means:

  • Integration tests never actually run in CI — only locally by developers with AWS credentials
  • The "combined" coverage badge will just reflect unit test coverage

What needs to happen

The workflow should match the amazon_bedrock.yml pattern — add an OIDC role assumption step and gate the integration test run on its success:

# Do not authenticate on PRs from forks and on PRs created by dependabot
- name: AWS authentication
  id: aws-auth
  if: github.event_name == 'schedule' || (github.event.pull_request.head.repo.full_name == github.repository && !startsWith(github.event.pull_request.head.ref, 'dependabot/'))
  uses: aws-actions/configure-aws-credentials@ec61189d14ec14c8efccab744f656cffd0e33f37
  with:
    aws-region: us-east-1
    role-to-assume: ${{ secrets.AWS_S3_VECTORS_CI_ROLE_ARN }}

- name: Run integration tests
  if: success() && steps.aws-auth.outcome == 'success'
  run: hatch run test:integration-cov-append-retry

Prerequisites (maintainer action required)

  1. Create an IAM role with s3vectors:* permissions (scoped to haystack-test-* bucket names)
  2. Configure the role's trust policy for GitHub OIDC (token.actions.githubusercontent.com)
  3. Add the role ARN as a repository secret (e.g. AWS_S3_VECTORS_CI_ROLE_ARN)

@anakin87
Copy link
Copy Markdown
Member

@dotKokott I'll try to take a look in the next few days.

Have you tried the integration yourself in a real-world setting with AWS?

@dotKokott
Copy link
Copy Markdown
Contributor Author

dotKokott commented Apr 15, 2026

@dotKokott I'll try to take a look in the next few days.

Have you tried the integration yourself in a real-world setting with AWS?

I have tried all integration tests and examples on my AWS account.

However I did not try with any large datasets. That might be next thing to validate: does this work as expected with real load.

Copy link
Copy Markdown
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some initial comments.
Will take a better look soon

Comment thread .github/workflows/amazon_s3_vectors.yml
@@ -0,0 +1,207 @@
# amazon-s3-vectors-haystack
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to have a very minimal README (see https://github.com/deepset-ai/haystack-core-integrations/blob/main/integrations/amazon_bedrock/README.md)

This info is useful but we'll put it in docs

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Should I leave it in the README until we know where you will put the info? Or would you like me to minimize the README already?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes let's minimize it, but please save this info somewhere to be re-used in docs

Matches the pattern used by the amazon_bedrock workflow:
- top-level id-token: write permission
- AWS_REGION env var
- configure-aws-credentials step (skipped on fork PRs and dependabot)
- integration tests gated on successful auth
Matches the repo convention used across other integrations.
Comment on lines +264 to +267
for doc in batch:
if doc.embedding is None:
msg = f"Document '{doc.id}' has no embedding. S3VectorsDocumentStore requires embeddings."
raise DocumentStoreError(msg)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do this check not for the batch but initially for all docs, to raise an error before start writing

from haystack_integrations.components.retrievers.amazon_s3_vectors import S3VectorsEmbeddingRetriever
from haystack_integrations.document_stores.amazon_s3_vectors import S3VectorsDocumentStore


Copy link
Copy Markdown
Member

@anakin87 anakin87 Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For integration tests of Document Stores, we should inherit from Haystack's DocumentStoreBaseExtendedTests, which already contains the necessary tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:CI type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Amazon S3 Vectors (DocStore)

2 participants